Introduction

Session overview

In this workshop you will get practice in choosing between, performing, and presenting the results of one- and two- sample t-tests and their non-parametric equivalents in R.

Learning Outcomes

By actively following the materials and carrying out the independent study before and after the contact hours the successful student will be able to:

  • Explain the difference between dependent and independent samples (MLO 2)
  • Select, appropriately, t-tests and their non-parametric equivalents (the Wilcoxon tests) (MLO 2)
  • Apply and interpret the tests in R (MLO 3 and 4)
  • Evaluate whether the assumptions of a test are met
  • Summarise and illustrate with appropriate R figures test results scientifically (MLO 3 and 4)

Philosophy

Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. Do not be put off and don’t let what you can not do interfere with what you can do. You will benefit from collaborating with others and/or discussing your results. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. You may wish to refer to the independent study materials for reference.

Materials are indexed here: https://3mmarand.github.io/BIO00017C-Data-Analysis-in-R-2020/

Key

These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.

W is something you need to do on your computer. It may be opening programs or documents or locating a file.

R is something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.

GC is something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.

Q is question for you to think about an answer. You will usually want to record your answers in your script for future reference.

Artwork by Allison Horst

Getting started

W Start RStudio from the Start menu.

R Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says Project: (None) and choosing New Project and then New Directory, then New Project. Navigate to the β€œdata-analysis-in-r” folder. Name the RStudio Project β€˜workshop4’.

R On the Files tab click on New Folder. In the box that appears type β€œdata”. This will be the folder that we save data files to.

R Make a new script then save it with a name like analysis.R to carry out the rest of the work.

R Load the tidyverse:

library(tidyverse)

Exercises

Adiponectin secretion

Adiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not.

W Save a copy of the data file adipocytes.txt

R Read in the data and check the structure. I used the name adip for the dataframe/tibble.

#---CODING ANSWER---
# import
adip  <-  read_table2("data/adipocytes.txt")
str(adip)

We have a tibble containing two variables: adiponectin is the response and is continuous and treatment is explanatory. treatment is categorical with two levels (groups). The first task is visualise the data to get an overview. For continuous response variables with categorical explanatory variables you could use geom_point(), geom_boxplot() or a variety of other geoms. I often use geom_violin() which allows us to see the distribution - the violin is fatter where there are more data points.

R Do a quick plot of the data:

ggplot(data = adip, aes(x = treatment, y = adiponectin)) +
  geom_violin()

Top Tip

Always do a quick data visualisation before you do any analysis.

Summarising the data

Summarising the data for each treatment group is the next sensible step. The most useful summary statistics are the means, standard deviations, sample sizes and standard errors.

R Create a data frame called adipsummary that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples. You may want to look this up in your script from the second and third workshops.

You should get the following numbers:

treatment mean std n se
control 5.546000 1.475247 15 0.3809072
nicotinic 7.508667 1.793898 15 0.4631824

Selecting a test

Q Do you think this is a paired-sample test or two-sample test?

Applying, interpreting and reporting

R You can carry out a two-sample t-test with:

t.test(data = adip,
       adiponectin ~ treatment,
       var.equal = T)
## 
##  Two Sample t-test
## 
## data:  adiponectin by treatment
## t = -3.2728, df = 28, p-value = 0.00283
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.1910762 -0.7342571
## sample estimates:
##   mean in group control mean in group nicotinic 
##                5.546000                7.508667

Q What do you conclude from the test? Write your conclusion in a form suitable for a report.

Check assumptions

The assumptions of the two-sample t-test are that the residuals – the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we need to calculate the residuals and we can do this in two steps:

R First by adding a column that holds the mean for the group each value belongs to by β€˜merging’ the first two columns (treatment and mean) of the summary data into the raw data:

# add the group means to the data
adip <- merge(adip, adipsummary[1:2], by = "treatment")

adipsummary[1:2] is the first two columns of adipsummary: we want to add the mean and need the treatment to match the rows. We don’t need to add the other columns.

Every row where treatment in the adip dataframe matches treatment in the adipsummary dataframe, the mean, std column of adipsummary, is added to adip.

R look at the adip dataframe to see what this code has done.

R Second by adding a column for each β€˜residual’

# add the residuals
adip <- adip %>%
  mutate(residual = adiponectin - mean)

Now we are ready to examine the distribution of the residuals.

R Check the residuals are homogenously distributed (variance is the same in both groups):

ggplot(data = adip,
       aes(x = mean, y = residual)) +
  geom_point()

There’s a bit of an outlier in one group but this looks β€˜ok’.

R Check the residuals are noramlly distributed:

ggplot(data = adip,
       aes(x = residual)) +
  geom_histogram(bins = 10)

This also looks ok.

We can also use a Shapiro-Wilk normality test. The null hypothesis of this test is that the residuals follow a normal distribution.

R Run a normality test:

shapiro.test(adip$residual)
## 
##  Shapiro-Wilk normality test
## 
## data:  adip$residual
## W = 0.97562, p-value = 0.701

Usually, when we are doing statistical tests we would like the the test to be significant because it means we have evidence of a biological effect. However, when doing normality tests we hope it will not be significant. A non-significant result means that there is no significant difference between the distribution of the residuals and a normal distribution and that indicates the t-test assumptions are met.

Q What do you conclude from the result of the normality tests?

Illustrating

We are going to create a figure like this:

In this figure, we have the data points themselves which are in adip dataframe and the means and standard errors which are in the adipsummary dataframe. That is, we have two dataframes we want to plot. Here you will learn that dataframes and aesthetics can be specified within a geom_... (rather than in the ggplot()) if the geom only applies to some of the data you want to plot.

We will build the plot up in small steps but as you get more used to ggplot you’ll probably be able to create figures in fewer steps.

You can either make a new plot each time OR edit your existing ggplot() command as we go.

R First, create an empty plot:

ggplot()

R Now add the data points:

ggplot() +
  geom_point(data = adip, aes(x = treatment, y = adiponectin))

Notice how we have given the data argument and the aesthetics inside the geom. The variables treatment and adiponectin are in the adip dataframe

R So the data points don’t overlap, we can add some random jitter in the x direction:

ggplot() +
  geom_point(data = adip, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0))

We’ve set the vertical jitter to 0 because, in contrast to the categorical x-axis, movement on the y-axis has meaning (adiponectin).

R Let’s make the points a light grey:

ggplot() +
  geom_point(data = adip, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "grey50")

Now to add the errorbars. These go from one standard error below the mean to one standard error above the mean.

R Add a geom_errorbar() for errorbars:

ggplot() +
  geom_point(data = adip, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "grey50") +
  geom_errorbar(data = adipsummary, 
                aes(x = treatment, ymin = mean - se, ymax = mean + se),
                width = 0.3) 

We have specified the adipsummary dataframe and the variables treatment, mean and se are in that.

There are several ways you could add the mean. You could use geom_point() but I like to use geom_errorbar() again with the ymin and ymax both set to the mean.

R Add a geom_errorbar() for the mean:

ggplot() +
  geom_point(data = adip, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "grey50") +
  geom_errorbar(data = adipsummary, 
                aes(x = treatment, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = adipsummary, 
                aes(x = treatment, ymin = mean, ymax = mean),
                width = 0.2)

R Alter the axis labels and limits:

ggplot() +
  geom_point(data = adip, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "grey50") +
  geom_errorbar(data = adipsummary, 
                aes(x = treatment, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = adipsummary, 
                aes(x = treatment, ymin = mean, ymax = mean),
                width = 0.2) +
  scale_y_continuous(name = "Adiponectin (pg/mL)", 
                     limits = c(0, 12), 
                     expand = c(0, 0)) +
  scale_x_discrete(name = "Treatment", labels = c("Control", "Nicotinic acid"))

R Format the figure in a way that is more suitable for including in a report:

ggplot() +
  geom_point(data = adip, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "gray50") +
  geom_errorbar(data = adipsummary, 
                aes(x = treatment, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = adipsummary, 
                aes(x = treatment, ymin = mean, ymax = mean),
                width = 0.2) +
  scale_y_continuous(name = "Adiponectin (pg/mL)", 
                     limits = c(0, 12), 
                     expand = c(0, 0)) +
  theme_classic()

Top Tip

The code required to summarise, test, and plot data for any two-sample t-test AND for any for any one-way ANOVA is exactly the same except for the names of the dataframe, variables and the axis labels and limits. Take some time to comment it. Consider making a script called ttest.R or similar with all the code and information you need to reuse it.

Later in the module, you’ll learn how to put annotation on the figure like this:

Grouse Parasites

These data come from a sample of grouse shot in Scotland. The grouse livers were dissected and the number of individuals of a parasitic nematode were counted for two estates β€˜Gordon’ and β€˜Moss’. We want to know if the two estates have different infection rates.

gordon: 5, 16, 8, 64, 51, 11, 9, 7, 43, 49

moss: 0, 2, 1, 3, 6, 10, 4, 12, 19, 20

R Create dataframe for these data. I put it in dataframe in two columns (which allowed me to copy and paste the values) then used pivot_longer()

You are aim for this:

Selecting

Q Using your common sense, do these data look normally distributed?

Q What test do you suggest?

Applying, interpreting and reporting

R Summarise the data by finding the median of each group:

R Carry out a two-sample Wilcoxon test (also known as a Mann-Whitney):

wilcox.test(data = grouse, nematodes ~ estate)
## 
##  Wilcoxon rank sum exact test
## 
## data:  nematodes by estate
## W = 78, p-value = 0.03546
## alternative hypothesis: true location shift is not equal to 0

Q What do you conclude from the test? Write your conclusion in a form suitable for a report.

Illustrating

A box plot is a usually good choice for illustrating a two-sample Wilcoxon test because it shows the median and interquartile range.

R We can create a simple boxplot with:

ggplot(data = grouse, aes(x = estate, y = nematodes) ) +
  geom_boxplot() 

R Format the figure so it is more suitable for a report.

Gene Expression

Researchers are interested in the expression levels of a particular set of 35 E.coli genes in response to heat stress. They measure the expression of the genes at 37 and 42 degrees C (labelled low and high temperature). These samples are not independent because we might expect there to be a relationship between expression at 37 and 42 degrees within a gene.

Selecting

Q What is the null hypothesis?

W Save a copy of coliexp.txt

R Read the data in

Q What is the appropriate parametric test?

Applying, interpreting and reporting

R Now carry out a paired-sample t-test.

t.test(data = coliexp, expression ~ temperature, paired = T)
## 
##  Paired t-test
## 
## data:  expression by temperature
## t = 3.2039, df = 34, p-value = 0.002943
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1123712 0.5021946
## sample estimates:
## mean of the differences 
##               0.3072829

Q State your conclusion from the test in a form suitable for including in a report. Make sure you give the direction of any significant effect.

πŸŽ‰ Well Done! πŸŽ‰

Independent study following the workshop

Decide which test you should use to analyse the each following data sets. In each case give the reasons for your choice of test and state the null hypothesis. Write your conclusions in a form suitable for including in a report. Can you make figures?

1. Plant Biotech

Some plant biotechnologists are trying to increase the quantity of omega 3 fatty acids in Cannabis sativa. They have developed a genetically modified line using genes from Linum usitatissimum (linseed). They grow 50 wild type and fifty modified plants to maturity, collect the seeds and determine the amount of omega 3 fatty acids. The data are in csativa.txt. Do you think their modification has been successful?

2. Sheep diet

In order to investigate the effects of feeding fertilised grass to sheep, one of each pair of fourteen sets of twins was fed fertilised grass whilst the other was fed unfertilised grass and the adult weight of the sheep was recorded. The data are in sheep.txt . Is there difference in the effect of fertilised and unfertilised grass on sheep weight?

The Code files

These contain all the code needed in the workshop even where it is not visible on the webpage.

Rmd file The Rmd file is the file I use to compile the practical. Rmd stands for R markdown. It allows R code and ordinary text to be interweaved to produce well-formatted reports including webpages. If you right-click on the link and choose Save-As, you will be able to open the Rmd file in RStudio. Alternatively, View in Browser.

Plain script file This is plain script (.R) version of the practical generated from the Rmd. Again, you can save this and open it RStudio. Alternatively, View in Browser.

Pages made with rmarkdown (Allaire Xie, et al., 2019a; Xie Allaire, et al., 2018a), kableExtra (Zhu, 2019a), RefManager (McLean, 2014)

References

Allaire, J., Y. Xie, et al. (2019a). rmarkdown: Dynamic Documents for R. R package version 1.16. URL: https://github.com/rstudio/rmarkdown.

McLean, M. W. (2014). Straightforward Bibliography Management in R Using the RefManager Package. arXiv: 1403.2036 [cs.DL]. URL: https://arxiv.org/abs/1403.2036.

Xie, Y., J. Allaire, et al. (2018a). R Markdown: The Definitive Guide. ISBN 9781138359338. Boca Raton, Florida: Chapman and Hall/CRC. URL: https://bookdown.org/yihui/rmarkdown.

Zhu, H. (2019a). kableExtra: Construct Complex Table with β€˜kable’ and Pipe Syntax. R package version 1.1.0. URL: https://CRAN.R-project.org/package=kableExtra.